Test Refiner¶

Sample of 10 clusters. From each cluster a subsample of N sequences has been recovered, with N:

  • 100, 150, 200, 250, 300, 350, 400, 450, 500

Then, an another subsample was done taking in account the length of sequences. A sequence is sampled if it has at least len% of the maximum length of the cluster (and so on with a maximum size of 500 sequences), with len%:

  • 20%, 40%, 60%, 80%, 100%
In [1]:
from glob import glob
import pandas as pd
import re
from Bio import SeqIO
import Levenshtein as lv
import plotly.express as px
In [2]:
work_dir = "/home/lerat/TE_Aalb/test/cdhit"
sampled_clst_dir = sorted(list(glob(work_dir+"/cluster_*", recursive = True)))
In [3]:
clst_type = ["100", "150", "200", "250", "300", "350", "400", "450", "500", "20%", "40%", "60%", "80%", "100%"]
info = ["nb_seq", "cons_length", "div"]

columns = ["cluster"]
for t in clst_type:
    for i in info:
        columns.append(i + "-" + t)

refiner = pd.DataFrame(columns=columns)
In [4]:
sizes = [100, 150, 200, 250, 300, 350, 400, 450, 500]
lengths = [2, 4, 6, 8, 10]

for clst in sampled_clst_dir:
    row = []
    
    cluster_name = clst.split("/")[-1]
    row.append(cluster_name)
    
    cons_500 = clst + "/cluster-500.clst.fasta.refiner_cons"
    for seq_record in SeqIO.parse(cons_500, "fasta"):
        seq_500 = seq_record.seq
    
    for s in sizes:
        sample_name = clst + "/cluster-" + str(s) + ".clst.fasta"
        
        nb_seq = 0
        with open(sample_name, "r") as f:
            data = f.read()
            nb_seq = len(re.findall(r'>', data))
            row.append(nb_seq)
        
        cons_file = clst + "/cluster-" + str(s) + ".clst.fasta.refiner_cons"
        
        for seq_record in SeqIO.parse(cons_file, "fasta"):
            cons_len = len(seq_record.seq)
            row.append(cons_len)
            dist = 1 - lv.ratio(str(seq_500), str(seq_record.seq))
            row.append(dist)
    
    for l in lengths:
        sample_name = clst + "/cluster-" + str(l) + ".clst.fasta"
        
        nb_seq = 0
        with open(sample_name, "r") as f:
            data = f.read()
            nb_seq = len(re.findall(r'>', data))
            row.append(nb_seq)
        
        cons_file = clst + "/cluster-" + str(l) + ".clst.fasta.refiner_cons"
        
        for seq_record in SeqIO.parse(cons_file, "fasta"):
            cons_len = len(seq_record.seq)
            row.append(cons_len)
            dist = 1 - lv.ratio(str(seq_500), str(seq_record.seq))
            row.append(dist)

    refiner.loc[-1] = row
    refiner.index = refiner.index + 1
    refiner = refiner.sort_index()
    
refiner = refiner.set_index('cluster')

Dataset description¶

Each row is a sampled cluster. Each cluster was sampled with 100, 150, 200, 250, 300, 350, 400, 450 and 500 sequences. For each sample we have compute the number of sequences (it can be less then expected if the cluster size is smaller than the sample size), the length of the generated consensus (using Refiner) and the distance of the consensus with the consensus generated using the sample a 500 sequences. Then each cluster was sampled by the sequences length, taking in account sequences having at least the 20%, 40%, 60%, 80% or the 100% of the longest sequence in the cluster, to reach 500 sampled sequences (when possible). As before, we have compute the number of sequences, the consensus length and the distance to the sample with 500 sequences.

In [5]:
refiner
Out[5]:
nb_seq-100 cons_length-100 div-100 nb_seq-150 cons_length-150 div-150 nb_seq-200 cons_length-200 div-200 nb_seq-250 ... div-40% nb_seq-60% cons_length-60% div-60% nb_seq-80% cons_length-80% div-80% nb_seq-100% cons_length-100% div-100%
cluster
cluster_9 100 446 0.126984 150 446 0.126984 200 446 0.130952 250 ... 0.791651 2 4636 0.784148 2 4636 0.784148 1 4805 0.791690
cluster_8 100 202 0.212871 150 202 0.000000 200 202 0.000000 250 ... 0.080569 1 543 0.476510 1 543 0.476510 1 543 0.476510
cluster_7 100 1403 0.076156 150 1493 0.056506 200 1492 0.057494 250 ... 0.021529 1 3509 0.410131 1 3509 0.410131 1 3509 0.410131
cluster_6 100 900 0.001668 119 899 0.000000 119 899 0.000000 119 ... 0.007795 37 897 0.007795 12 897 0.008909 1 899 0.414905
cluster_5 100 147 0.000000 150 147 0.000000 200 147 0.000000 250 ... 0.944064 1 5109 0.944064 1 5109 0.944064 1 5109 0.944064
cluster_4 100 147 0.314149 150 147 0.314149 200 260 0.022642 250 ... 0.473451 1 625 0.472626 1 625 0.472626 1 625 0.472626
cluster_3 100 1433 0.612737 150 1869 0.532497 200 1862 0.533615 250 ... 0.060198 1 5947 0.066420 1 5947 0.066420 1 5947 0.066420
cluster_2 100 1562 0.001601 127 1561 0.000000 127 1561 0.000000 127 ... 0.000000 16 1603 0.014539 5 1610 0.016083 1 1608 0.026191
cluster_1 100 1164 0.493554 126 3180 0.000000 126 3180 0.000000 126 ... 0.021760 1 3162 0.021760 1 3162 0.021760 1 3162 0.021760
cluster_0 100 4102 0.393554 150 6571 0.004099 150 6571 0.004099 150 ... 0.028089 3 6534 0.028089 1 6527 0.029855 1 6527 0.029855

10 rows × 42 columns

The number of sampled sequences¶

In [6]:
sampled_seq = refiner[['nb_seq-100', 'nb_seq-150', 'nb_seq-200', 'nb_seq-250', 'nb_seq-300', 'nb_seq-350', \
                       'nb_seq-400', 'nb_seq-450', 'nb_seq-500', 'nb_seq-20%', 'nb_seq-40%', 'nb_seq-60%', \
                       'nb_seq-80%', 'nb_seq-100%']]
In [8]:
sampled_seq
Out[8]:
nb_seq-100 nb_seq-150 nb_seq-200 nb_seq-250 nb_seq-300 nb_seq-350 nb_seq-400 nb_seq-450 nb_seq-500 nb_seq-20% nb_seq-40% nb_seq-60% nb_seq-80% nb_seq-100%
cluster
cluster_9 100 150 200 250 300 350 400 450 500 4 4 2 2 1
cluster_8 100 150 200 250 256 256 256 256 256 236 3 1 1 1
cluster_7 100 150 200 250 300 350 400 450 500 593 6 1 1 1
cluster_6 100 119 119 119 119 119 119 119 119 45 40 37 12 1
cluster_5 100 150 200 250 300 350 400 450 500 1 1 1 1 1
cluster_4 100 150 200 250 300 350 372 372 372 128 38 1 1 1
cluster_3 100 150 200 250 300 350 400 450 451 9 4 1 1 1
cluster_2 100 127 127 127 127 127 127 127 127 123 109 16 5 1
cluster_1 100 126 126 126 126 126 126 126 126 15 1 1 1 1
cluster_0 100 150 150 150 150 150 150 150 150 37 3 3 1 1

The consensus length¶

In [9]:
cons_length = refiner[['cons_length-100', 'cons_length-150', 'cons_length-200', 'cons_length-250', \
                       'cons_length-300', 'cons_length-350', 'cons_length-400', 'cons_length-450', \
                       'cons_length-500', 'cons_length-20%', 'cons_length-40%', 'cons_length-60%', \
                       'cons_length-80%', 'cons_length-100%']]
cons_length = cons_length.T
In [10]:
cons_length
Out[10]:
cluster cluster_9 cluster_8 cluster_7 cluster_6 cluster_5 cluster_4 cluster_3 cluster_2 cluster_1 cluster_0
cons_length-100 446 202 1403 900 147 147 1433 1562 1164 4102
cons_length-150 446 202 1493 899 147 147 1869 1561 3180 6571
cons_length-200 446 202 1492 899 147 260 1862 1561 3180 6571
cons_length-250 444 202 1495 899 147 270 2678 1561 3180 6571
cons_length-300 562 202 1494 899 147 270 2810 1561 3180 6570
cons_length-350 562 202 1343 899 147 270 2810 1561 3180 6604
cons_length-400 562 202 1343 899 147 270 2809 1561 3180 6572
cons_length-450 562 202 1343 899 147 270 3244 1561 3180 6571
cons_length-500 562 202 1604 899 147 270 5947 1561 3180 6603
cons_length-20% 4804 202 1601 897 5109 147 5948 1561 3169 6704
cons_length-40% 4804 220 1601 897 5109 634 5947 1561 3162 6534
cons_length-60% 4636 543 3509 897 5109 625 5947 1603 3162 6534
cons_length-80% 4636 543 3509 897 5109 625 5947 1610 3162 6527
cons_length-100% 4805 543 3509 899 5109 625 5947 1608 3162 6527
In [11]:
fig = px.line(cons_length)
fig.show()

The distance to the sample at 500 sequences¶

In [12]:
cons_div = refiner[['div-100', 'div-150', 'div-200', 'div-250', 'div-300', 'div-350', 'div-400', 'div-450', \
                    'div-500', 'div-20%', 'div-40%', 'div-60%', 'div-80%', 'div-100%']]
cons_div = cons_div.T
In [13]:
cons_div
Out[13]:
cluster cluster_9 cluster_8 cluster_7 cluster_6 cluster_5 cluster_4 cluster_3 cluster_2 cluster_1 cluster_0
div-100 0.126984 0.212871 0.076156 0.001668 0.000000 0.314149 0.612737 0.001601 0.493554 0.393554
div-150 0.126984 0.000000 0.056506 0.000000 0.000000 0.314149 0.532497 0.000000 0.000000 0.004099
div-200 0.130952 0.000000 0.057494 0.000000 0.000000 0.022642 0.533615 0.000000 0.000000 0.004099
div-250 0.379722 0.000000 0.054534 0.000000 0.000000 0.003704 0.444638 0.000000 0.000000 0.003188
div-300 0.001779 0.000000 0.054229 0.000000 0.000000 0.003704 0.363252 0.000000 0.000000 0.003264
div-350 0.003559 0.000000 0.098744 0.000000 0.000000 0.000000 0.362567 0.000000 0.000000 0.000681
div-400 0.005338 0.000000 0.098744 0.000000 0.000000 0.000000 0.361809 0.000000 0.000000 0.004023
div-450 0.000000 0.000000 0.100102 0.000000 0.000000 0.000000 0.296268 0.000000 0.000000 0.004099
div-500 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
div-20% 0.791651 0.000000 0.003432 0.007795 0.944064 0.309353 0.058092 0.000000 0.018113 0.039603
div-40% 0.791651 0.080569 0.021529 0.007795 0.944064 0.473451 0.060198 0.000000 0.021760 0.028089
div-60% 0.784148 0.476510 0.410131 0.007795 0.944064 0.472626 0.066420 0.014539 0.021760 0.028089
div-80% 0.784148 0.476510 0.410131 0.008909 0.944064 0.472626 0.066420 0.016083 0.021760 0.029855
div-100% 0.791690 0.476510 0.410131 0.414905 0.944064 0.472626 0.066420 0.026191 0.021760 0.029855
In [14]:
fig = px.line(cons_div)
fig.show()